About data and models

Day 1

Manuele Bazzichetto

Premises

  • I am not a statistician! He was:
  • Intuition on stats concepts, abstractions and techniques used on a daily basis
  • Quick questions if you are completely lost!
    Big questions 👉 Thursday

Tell us about you

This story begins with..data



Data \(\approx\) Information



Usually represented in the form of continuous, discrete or qualitative measures taken on natural or experimental phenomena

An example..


PA Elevation PlantRichness Habitat
0 302.69 4 Forest
0 299.80 3 Forest
0 294.75 5 Mixed
0 285.91 6 Mixed
0 281.66 4 Grassland
0 298.42 12 Mixed
1 307.54 2 Mixed
0 290.88 2 Forest
1 308.00 3 Mixed
0 314.91 1 Grassland

Data vs. reality

Data are just a incomplete part of the story

Usually, we want to know the full ‘truth’ (and not a part of it)

Think bigger..use a model


A MODEL is:

  • Simplification/approximation of reality
  • Set of assumptions describing the data-generating-process
  • A tool for making inference (drawing conclusions) about the World

(Ecological) Models..you never know which one to pick up

From Ecological models and data in R - Bolker

All models are wrong, but some are useful

(cit. Box)

Our model(s) for today


Focus: parametric models


Ecologists use them for:

  • Estimation of parameters
  • Hypothesis testing
  • Regression models
  • Occupancy and species distribution models
  • Structural equation modelling

Example: let’s go back to the data

We are studying the body size of Gentoo penguins in the Antarctica1

We collect a bunch of data (a sample), which look like this 👇

ID BodySize
1 5085.467
2 4983.132
3 4384.706
4 4773.966
5 5224.501
6 5272.518
7 4467.005
8 4892.681

Assumption: Body mass of (all existing) Gentoo’s penguins is normally distributed with some mean and variance


Parametric model: \(Gentoo\hspace{1 mm}body\hspace{1 mm}size \sim \mathcal{N}(\mu,\, \sigma^{2})\)

What I think I am doing


Model:

\(\mu_i = \alpha + \beta \cdot flipper\hspace{1mm}length_i\)

What I am actually doing

Model:

\(Gentoo\hspace{1 mm}body\hspace{1 mm}size_i \sim \mathcal{N}(\mu_i,\, \sigma^{2})\)

\(\mu_i = \alpha + \beta \cdot flipper\hspace{1mm}length_i\)

So what we need data for?



Data \(\approx\) Information



We need data to estimate parameters

Few things to keep in mind


  • Population: think about it as all elements of our study target (all existing Gentoo’s penguin)
  • Sample: a subset of the population obtained through sampling (a bunch of data)
  • Assumptions pertain to the population, not to the sample
  • Inference is made on population parameters, not on sample statistics (e.g., sample mean)

Bestiary of distributions For continuous measures - aka pdf

Uniform

\(X \sim \mathcal{U}(a,\,b)\\,with\hspace{2mm}a = 1\hspace{2mm}and\hspace{2mm} b\hspace{1mm}=\hspace{1mm}3\)

Pdf: \(f(x) = \begin{cases}\frac{1}{b-a} & \text{for } a \le x \le b, \\[8pt]0 & \text{for } x < a \ \text{ or } \ x > b.\end{cases}\)


  • Measures have all same density
  • Used a lot as a prior for bounded parameters
  • It has a discrete analog

Gaussian (or normal)

\(X \sim \mathcal{N}(\mu,\,\sigma^{2})\\,with\hspace{2mm}\mu = 3\hspace{2mm}and\hspace{2mm} \sigma\hspace{1mm}=\hspace{1mm}2\)

Pdf: \(f(x) = \frac{1}{\sigma \sqrt{2\pi} } e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\)


  • Symmetric
  • Proportion of observations falling within \({\displaystyle \pm }\hspace{1mm}X\cdot\sigma\) from the mean does not change
  • Naturally arises under several conditions (sums and other transformations of random variables)

Standard normal

\(Z \sim \mathcal{N}(0,\,1)\)

\(Z = \frac{X - \mu}{\sigma}\\,with\hspace{1mm}X\sim\mathcal{N}(\mu, \sigma^2)\)

Pdf: \(f(z) = \frac{e^{-z^2/2}}{\sqrt{2\pi}}\)


  • ‘Scale’ Gaussian to have 0 mean and unit variance
  • Re-express Gaussian in terms of population sd from the mean (Z-score)
  • Sum of \(Z^2\) is \(\chi^2\) distributed

Beta

\(X \sim \mathcal{U}(a,\,b)\\,with\hspace{2mm}a = 1\hspace{2mm}and\hspace{2mm} b\hspace{1mm}=\hspace{1mm}3\)

Pdf: Let it be..


  • Used a lot as a prior for parameters bounded between 0 and 1

Other common pdf(s)

  • t-Student (check suppl. mat.)
  • \(\chi^2\) distribution (check suppl. mat.)
  • F-distribution
  • Gamma (only for \(X \geq 0\))
  • Exponential

Bestiary of distributions For discrete measures - aka pmf

Bernoulli

\(Y \sim Bern(p)\\,with\hspace{1mm} Y\hspace{1mm} assuming\hspace{1mm}value\hspace{1mm} 0\hspace{1mm} or\hspace{1mm} 1\)

Pmf: \(Pr(Y) = p^Y(1-p)^{(1-Y)}\)


  • Limiting case of a Binomial distribution for a single trial
  • \(Pr(Y=1)\) is \(p\) (prob. \(Y\) assuming value 1 is \(p\))
  • Assumed distribution for presence/absence data

Binomial

\(Y \sim Binomial(p, N)\\,with\hspace{1mm} Y\hspace{1mm} assuming \hspace{1mm}value\hspace{1mm} from \hspace{1mm}0\hspace{1mm} to \hspace{1mm} or\hspace{1mm} N\)

Pmf: \(Pr(Y) = \binom{N}{Y} p^Y(1-p)^{(N-Y)}\)


  • Arises from combining the probability of independent events
  • Used to model counts with a known upper bound
  • Converges to Gaussian when N gets large

Poisson

\(Y \sim Pois(\lambda)\\,with\hspace{1mm} Y\hspace{1mm} assuming \hspace{1mm}value\hspace{1mm} \geq 0\)

Pmf: \(Pr(Y) = \frac{\lambda^Y\exp^{-\lambda}}{Y!}\)


  • \(Mean = variance\)
  • Limiting case of binomial with \(N\) large and \(p\) small
  • Used to model counts with not known upper bound
  • Converges to Gaussian as \(\lambda\) gets large

Other common pmf(s)

  • Negative binomial
  • Geometric
  • Multinomial (a binomial for more than 2 outcomes)
  • Discrete uniform

Supplementary material

Other continuous pdf(s)

t-Student

\(T \sim \mathcal{t}(0,\,\frac{n}{n-2})\\,undefined\hspace{1mm}if\hspace{1mm}n<2\)


\(T = \frac{X - \mu}{\hat{\sigma}}\\,with\hspace{1mm}X\sim\mathcal{N}(\mu, \sigma^2)\)

Pdf:

  • Assumed distribution for the t-statistics
  • Re-express Gaussian in terms of sample sd from the mean
  • Tightly linked to F-distribution
  • Flat tails when we have few data to estimate sd

Chi-squared

\(Q = \sum{Z^2}\)


\(Q \sim\chi^2\)

Pdf: 🙅‍♀️

  • Assumed distribution for the Chi-squared statistics
  • Mean is \(n\), variance is \(2\cdot{n}\)
  • F-distribution is \(\frac{\frac{\chi^2}{n_1}}{\frac{\chi^2}{n_2}}\)